{
    "cells": [
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "# download presidio\n",
                "!pip install presidio_analyzer presidio_anonymizer\n",
                "!python -m spacy download en_core_web_lg"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "###### Path to notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/presidio_notebook.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/presidio_notebook.ipynb)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from presidio_analyzer import AnalyzerEngine, PatternRecognizer\n",
                "from presidio_anonymizer import AnonymizerEngine\n",
                "from presidio_anonymizer.entities import OperatorConfig\n",
                "import json\n",
                "from pprint import pprint"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Analyze Text for PII Entities\n",
                "\n",
                "Using Presidio Analyzer, analyze a text to identify PII entities. \n",
                "The Presidio analyzer is using pre-defined entity recognizers, and offers the option to create custom recognizers.\n",
                "\n",
                "The following code sample will:\n",
                "\n",
                "- Set up the Analyzer engine: load the NLP module (spaCy model by default) and other PII recognizers\n",
                "- Call analyzer to get analyzed results for \"PHONE_NUMBER\" entity type\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "text_to_anonymize = \"His name is Mr. Jones and his phone number is 212-555-5555\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "analyzer = AnalyzerEngine()\n",
                "analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=[\"PHONE_NUMBER\"], language='en')\n",
                "\n",
                "print(analyzer_results)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Create Custom PII Entity Recognizers\n",
                "\n",
                "Presidio Analyzer comes with a pre-defined set of entity recognizers. It also allows adding new recognizers without changing the analyzer base code, **by creating custom recognizers**. \n",
                "In the following example, we will create two new recognizers of type `PatternRecognizer` to identify titles and pronouns in the analyzed text.\n",
                "A `PatternRecognizer` is a PII entity recognizer which uses regular expressions or deny-lists.\n",
                "\n",
                "The following code sample will:\n",
                "- Create custom recognizers\n",
                "- Add the new custom recognizers to the analyzer\n",
                "- Call analyzer to get results from the new recognizers"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "titles_recognizer = PatternRecognizer(supported_entity=\"TITLE\",\n",
                "                                      deny_list=[\"Mr.\",\"Mrs.\",\"Miss\"])\n",
                "\n",
                "pronoun_recognizer = PatternRecognizer(supported_entity=\"PRONOUN\",\n",
                "                                       deny_list=[\"he\", \"He\", \"his\", \"His\", \"she\", \"She\", \"hers\", \"Hers\"])\n",
                "\n",
                "analyzer.registry.add_recognizer(titles_recognizer)\n",
                "analyzer.registry.add_recognizer(pronoun_recognizer)\n",
                "\n",
                "analyzer_results = analyzer.analyze(text=text_to_anonymize,\n",
                "                            entities=[\"TITLE\", \"PRONOUN\"],\n",
                "                            language=\"en\")\n",
                "print(analyzer_results)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Call Presidio Analyzer and get analyzed results with all the configured recognizers - default and new custom recognizers"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')\n",
                "\n",
                "analyzer_results"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Anonymize Text with Identified PII Entities\n",
                "\n",
                "<br>Presidio Anonymizer iterates over the Presidio Analyzer result, and provides anonymization capabilities for the identified text.\n",
                "<br>The anonymizer provides 5 types of anonymizers - replace, redact, mask, hash and encrypt. The default is **replace**\n",
                "\n",
                "<br>The following code sample will:\n",
                "<ol>\n",
                "<li>Setup the anonymizer engine </li>\n",
                "<li>Create an anonymizer request - text to anonymize, list of anonymizers to apply and the results from the analyzer request</li>\n",
                "<li>Anonymize the text</li>\n",
                "</ol>"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "anonymizer = AnonymizerEngine()\n",
                "\n",
                "anonymized_results = anonymizer.anonymize(\n",
                "    text=text_to_anonymize,\n",
                "    analyzer_results=analyzer_results,    \n",
                "    operators={\"DEFAULT\": OperatorConfig(\"replace\", {\"new_value\": \"<ANONYMIZED>\"}), \n",
                "                        \"PHONE_NUMBER\": OperatorConfig(\"mask\", {\"type\": \"mask\", \"masking_char\" : \"*\", \"chars_to_mask\" : 12, \"from_end\" : True}),\n",
                "                        \"TITLE\": OperatorConfig(\"redact\", {})}\n",
                ")\n",
                "\n",
                "print(f\"text: {anonymized_results.text}\")\n",
                "print(\"detailed response:\")\n",
                "\n",
                "pprint(json.loads(anonymized_results.to_json()))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "presidio",
            "language": "python",
            "name": "presidio"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.7.9"
        },
        "metadata": {
            "interpreter": {
                "hash": "1baa965d5efe3ac65b79dfc60c0d706280b1da80fedb7760faf2759126c4f253"
            }
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}
